Statistical Analysis of Vietnamese Dialect Corpus and Dialect Identification Experiments

نویسندگان

  • Pham Ngoc Hung
  • Trinh Van Loan
  • Nguyen Hong Quang
چکیده

The performance of speech recognition systems will be improved if the corpus is organized in the specialized domain and is applied in a consistent way for speech recognition in specific situations. Vietnamese dialects are various. The building of corpus for Vietnamese dialect is the first step for implementing the system of dialect identification used for increasing the performance of Vietnamese recognition in general. This paper presents a method of building a corpus for Vietnamese dialect identification. Vietnamese corpus VDSPEC is built with topic-based recording and tonal balance. The duration of the corpus is 45.12 hours in total. The basic characteristics and preliminary evaluations of the corpus are also described. The statistical analysis of F0 variation and experiments on the classification of dialects using LDA projection showed that there are distinctions of pronunciation modality of Vietnamese for three dialects Hanoi, Hue and Ho Chi Minh city. For experiments on Vietnamese dialect identification, the first four formants, their bandwidths, and F0 variants have been used as input parameters for GMM. The experiment results for the dialect corpus of Vietnamese shows that the recognition rate is 66.3% without F0 information and this recognition rate increases to 72.2% with F0 information.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The effect of first language (L1) dialects on the identification of Vietnamese word-final stops

This study examined the extent to which speakers’ first language (L1) dialect affects the identification of word-final stops in Vietnamese. Stops in the word-final position are unreleased in Vietnamese. Further, there is a /t/-/k/ merger in the Southern, but not the Northern dialect. We tested the hypothesis that the stop tokens produced in the Southern dialect are identified less accurately th...

متن کامل

Arabic Dialect Identification Using a Parallel Multidialectal Corpus

We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously a...

متن کامل

Globalization, Standardization, and Dialect Leveling in Iran

This paper is an attempt to shed light on the effects of modernization, urbanization, monolingual educational system, and mass media as well as the process of globalization on dialect leveling among Persian dialects. In so doing, the first part of the paper elaborates on the relationship between globalization and sociolinguistics, and on the concept of standardization. Also, it discusses some ...

متن کامل

Dialect experience in Vietnamese tone perception.

This study investigated the perceptual dimensions of tone in Vietnamese and the effect of dialect experience on listener's prelinguistic perception of tone. While Northern Vietnamese tones are cued by a combination of pitch and voice quality, Southern Vietnamese tones are purely pitch based. 30 listeners from two Vietnamese dialects (10 Northern, 20 Southern) participated in a speeded AX discri...

متن کامل

Advances in Word based Dialect/

In an earlier study, we proposed a very effective dialect/accent classification algorithm, which is named Word based Dialect Classification (WDC). The WDC works well for large size corpora and significantly outperforms traditional Large Vocabulary Continuous Speech Recognition (LVCSR) based systems, which is claimed to be the best performing system for language identification. For a small train...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016